A Bayes Optimal Approach for Partitioning the Values of Categorical Attributes

نویسنده

  • Marc Boullé
چکیده

In supervised machine learning, the partitioning of the values (also called grouping) of a categorical attribute aims at constructing a new synthetic attribute which keeps the information of the initial attribute and reduces the number of its values. In this paper, we propose a new grouping method MODL founded on a Bayesian approach. The method relies on a model space of grouping models and on a prior distribution defined on this model space. This results in an evaluation criterion of grouping, which is minimal for the most probable grouping given the data, i.e. the Bayes optimal grouping. We propose new super-linear optimization heuristics that yields near-optimal groupings. Extensive comparative experiments demonstrate that the MODL grouping method builds high quality groupings in terms of predictive quality, robustness and small number of groups.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Grouping Method for Categorical Attributes Having Very Large Number of Values

In supervised machine learning, the partitioning of the values (also called grouping) of a categorical attribute aims at constructing a new synthetic attribute which keeps the information of the initial attribute and reduces the number of its values. In case of very large number of values, the risk of overfitting the data increases sharply and building good groupings becomes difficult. In this ...

متن کامل

A Framework for Optimal Attribute Evaluation and Selection in Hesitant Fuzzy Environment Based on Enhanced Ordered Weighted Entropy Approach for Medical Dataset

Background: In this paper, a generic hesitant fuzzy set (HFS) model for clustering various ECG beats according to weights of attributes is proposed. A comprehensive review of the electrocardiogram signal classification and segmentation methodologies indicates that algorithms which are able to effectively handle the nonstationary and uncertainty of the signals should be used for ECG analysis. Ex...

متن کامل

A New Probabilistic Approach in Rank Regression with Optimal Bayesian Partitioning

In this paper, we consider the supervised learning task which consists in predicting the normalized rank of a numerical variable. We introduce a novel probabilistic approach to estimate the posterior distribution of the target rank conditionally to the predictors. We turn this learning task into a model selection problem. For that, we define a 2D partitioning family obtained by discretizing num...

متن کامل

On Decision Boundaries of Naïve Bayes in Continuous Domains

in Continuous Domains Tapio Elomaa and Juho Rousu Department of Computer S ien e, University of Helsinki, Finland {elomaa,rousu} s.helsinki.fi Abstra t. Naïve Bayesian lassi ers assume the onditional independen e of attribute values given the lass. Despite this in pra ti e often violated assumption, these simple lassi ers have been found e ient, e e tive, and robust to noise. Dis retization of ...

متن کامل

A Divisive Ordering Algorithm for Mapping Categorical Data to Numeric Data

The amount of computing time for K Nearest Neighbor Search is linear to the size of the dataset if the dataset is not indexed. This is not endurable for on-line applications with time constraints when the dataset is large. However, if there are categorical attributes in the dataset, an index cannot be built on the dataset. One possible solution to index such datasets is to convert categorical a...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Journal of Machine Learning Research

دوره 6  شماره 

صفحات  -

تاریخ انتشار 2005